Using K-Means Clustering to Determine Groupings of Fatal Car Crashes

Kyle Knuth
Siva Gopavarapu
Nikhitha Karlapudi
Eswar Sai Burugupalley
Bala Venkataprasad Sadhana Vittal

2024-04-11

Intro to K-means clustering: (Unsupervised Technique)

  • K-means clustering groups data into clusters to maximize similarity and minimize dissimilarity. 

  • Methods for optimal cluster number: Elbow, Silhouette, and Gap Statistic. 

  • Preprocessing techniques like Canopy algorithm for large datasets. 

  • Cluster quality assessed using Calinski-Harabasz criterion. 

  • K-means process: iterative centroid recalculation for cluster convergence. 

  • Efficiency improved by using PCA for initial centroid estimation. 

Applications

  • K-means clustering applied in various fields: 

  • Color segmentation in fruit disease identification. 

  • Geographic clustering for aquifer risk assessment and regional classification in Indonesia. 

  • Medical applications: predicting chemotherapy response, analyzing skin changes with age, pregnancy problem analysis, and allergen sensitivity prediction. 

  • Tourism industry: clustering tourist preferences to enhance satisfaction. 

  • Agriculture: soil classification and spatial representation improvement. 

  • Economic and health behavior clustering for county-level analysis. 

Variations:

  • Constrained K-means clustering imposes lower bounds on clusters and distances between them, potentially deleting or consolidating clusters based on constraints 

  • Improved K-means algorithm enhances speed and efficiency by using two data structures to store cluster labels and distances, reducing repeated distance calculations  

  • Filtering algorithm optimizes Lloyd’s K-means using kd-tree structure, improving efficiency and scalability to higher dimensions 

  • K-means clustering can be integrated with MapReduce from Hadoop for efficient data processing by calculating distances between centroids and data points, iteratively updating centroids until convergence 

  • Extended K-means considers cluster sizes by sorting clusters based on the number of elements and recruits or drops elements from clusters accordingly. 

Limitations: 

  • Random initialization of centroids. 

  • Requires prior knowledge of the number of clusters. 

  • Difficulty handling diverse data types. 

  • Choosing incorrect number of clusters can introduce variability. 

  • Issues arise with complex data and scalability. 

  • Requires careful preprocessing and adequate sample size. 

Methods

The methodology behind k-means clustering involves preprocessing the data, picking variables of interest, determining the optimal number of clusters, performing the k-means clustering, and identifying what each cluster consists of. The data is typically preprocessed by first removing any categorical variables, then removing outliers, and finally performing feature scaling. Once the data has been preprocessed, variables of interest will be picked and then the optimal number of clusters will be determined.

Silhouette Coefficient

The first method of determining how many clusters to use is the Silhouette Coefficient.  The Silhouette Coefficient is used to measure how similar a data point is to its own cluster compared to other clusters based off a range of -1 to 1 where a more positive value means that the data is well placed within its cluster[@shi2021quantitative].

\[ S=b-a/max(b,\:a) \]

S: Silhouette coefficient

a: the average distance from one data point to the other within the same cluster

b: the minimum average distance from one point to points in other clusters

Elbow Method

The next method of determining how many clusters to use is the Elbow method. The Elbow method consists of plotting the within-cluster sum of squares against the number of clusters and visually selecting a point where the rate of decrease significantly slows down [@shi2021quantitative].

\[ WCSS=\sum^K_{k=1}\sum_{x_i=C_k}x_i-\mu^2_{k2} \]

WCSS: within-cluster sum of squares

x: data point in the cluster

CK: the current individual cluster

K: number of total clusters

\(\mu_k\): centroids corresponding to the cluster

Gap Statistic

The final method in determining how many clusters to use is the Gap statistic. This method finds the optimal number of clusters by comparing the within cluster dispersion changes to expected with a null dispersion[@tibshirani2001estimating].

\[ Gap_n(k)=E^*_n\{log(W_K)\}-log(W_k) \]

Gapn(k): Gap statistic

n: sample size

k: number of clusters

WK:within-cluster sum of squares surrounding cluster means

\(E^*_n\): Expectation of sample size

Euclidean Distance

The distance metric used for these methods is Euclidean distance. Euclidean distance is defined as the straight line distance between two points [@ultsch2022euclidean].

\[ d(x,y)=\sqrt{\sum|x_i-y_i|^2} \]

d(x,y): Euclidean distance

x: x-coordinate point

y: y-coordinate point

K-Means process

The actual method of k-means clustering is an iterative process consisting of the following steps [@9320839]:

  • Determine the number of clusters

  • Randomly select an initial centroid

  • Using Euclidean distance, data is assigned to clusters in accordance to the distance of the centroids

  • A new centroid is recalculated based on the mean of the previous centroids.

  • All the previous steps are repeated until the mean of the centroids no longer changes

Assumptions

The assumptions of k-means clustering are as follows [@raykov2016k]:

  • The data is in a sphere around the centroid and each sphere has the same radius

  • The average of the points in a cluster is the centroid of the cluster meaning that outliers will dramatically affect clustering

  • The density of each cluster is consistent

  • The number of clusters is known and fixed

Dataset

“Fatalities” dataset is from the AER R package which originally came from US Department of Transportation Fatal Accident Reporting System and is dated from 1982-1988 (The data includes all the states except Alaska and Hawaii)

Link to Fatalities data documentation

Variable Variable Type Description
state Categorical A categorical variable that represents the state
year Categorical The year is indicated by the categorical factor
spirits Continuous A numerical value indicating the amount of spirits consumed.
unemp Continuous The unemployment rate expressed as a numerical value.
income Continuous A numerical value in dollars that represents the per capita personal income. 
emppop Continuous The employment to population ratio expressed as a numerical value.
beertax Continuous The tax on a case of beer expressed as a numerical value.
baptist Continuous A number that indicates the proportion of Southern Baptists in the general population.
mormon Continuous A numerical value that indicates the proportion of Mormons in the general population.
drinkage Continuous A number that represents the minimum age at which alcohol is permitted. 
dry Continuous A numerical number that indicates the proportion of people living in “dry” counties. 
youngdrivers Continuous A numerical value indicating the proportion of drivers between the ages of 15 and 24. 
miles Continuous A numerical value indicating the average number of miles driven by each driver.
breath Boolean A categorical factor that denotes the existence of a law requiring a preliminary breath test. 
jail Boolean A categorical factor that indicates whether a jail sentence is required. 
service Boolean A categorical factor that denotes the existence of a community service requirement. 
fatal Continuous A number that indicates the overall number of people killed in cars. 
nfatal Continuous A numerical value that indicates the quantity of car deaths that occur at night.
sfatal Continuous A numerical value that indicates how many people have died in just one vehicle.
fatal1517 Continuous A numerical value that indicates how many teenagers between the ages of 15 and 17 were killed in cars.
nfatal1517 Continuous A numerical value that indicates how many teenagers, aged 15 to 17, are killed in cars at night.
fatal1820 Continuous A numerical value that indicates the total number of 18–20-year-olds who die in cars.
nfatal1820 Continuous A numerical value that indicates how many teenagers, aged 18 to 20, are killed in cars at night.
fatal2124 Continuous A numerical value that indicates how many people aged 21 to 24 are killed in cars.
nfatal2124 Continuous A numerical value that indicates how many people aged 21 to 24 are killed in cars.
afatal Continuous A numerical value that indicates the quantity of drunk driving deaths.
pop Continuous A numerical figure that indicates the entire population.
pop1517 Continuous A numerical value that indicates the population of people aged 15 to 17.
pop1820 Continuous A numerical value that denotes the 18–20 year old population.
pop2124 Continuous A numerical value that denotes the population of people aged 21 to 24.
milestot Continuous A number (in millions) that indicates the total number of miles driven by a vehicle.
unemppopus Continuous The US unemployment rate expressed as a numerical value. 
Emppopus Continuous The US employment to population ratio expressed as a numerical value. 
gsp Continuous A numerical value that indicates the Gross State Product’s (GSP) rate of change.
head(df)
  state year spirits unemp   income   emppop  beertax baptist  mormon drinkage
1    al 1982    1.37  14.4 10544.15 50.69204 1.539379 30.3557 0.32829    19.00
2    al 1983    1.36  13.7 10732.80 52.14703 1.788991 30.3336 0.34341    19.00
3    al 1984    1.32  11.1 11108.79 54.16809 1.714286 30.3115 0.35924    19.00
4    al 1985    1.28   8.9 11332.63 55.27114 1.652542 30.2895 0.37579    19.67
5    al 1986    1.23   9.8 11661.51 56.51450 1.609907 30.2674 0.39311    21.00
6    al 1987    1.18   7.8 11944.00 57.50988 1.560000 30.2453 0.41123    21.00
      dry youngdrivers    miles breath jail service fatal nfatal sfatal
1 25.0063     0.211572 7233.887     no   no      no   839    146     99
2 22.9942     0.210768 7836.348     no   no      no   930    154     98
3 24.0426     0.211484 8262.990     no   no      no   932    165     94
4 23.6339     0.211140 8726.917     no   no      no   882    146     98
5 23.4647     0.213400 8952.854     no   no      no  1081    172    119
6 23.7924     0.215527 9166.302     no   no      no  1110    181    114
  fatal1517 nfatal1517 fatal1820 nfatal1820 fatal2124 nfatal2124  afatal
1        53          9        99         34       120         32 309.438
2        71          8       108         26       124         35 341.834
3        49          7       103         25       118         34 304.872
4        66          9       100         23       114         45 276.742
5        82         10       120         23       119         29 360.716
6        94         11       127         31       138         30 368.421
      pop  pop1517  pop1820  pop2124 milestot unempus emppopus         gsp
1 3942002 208999.6 221553.4 290000.1    28516     9.7     57.8 -0.02212476
2 3960008 202000.1 219125.5 290000.2    31032     9.6     57.9  0.04655825
3 3988992 197000.0 216724.1 288000.2    32961     7.5     59.5  0.06279784
4 4021008 194999.7 214349.0 284000.3    35091     7.2     60.1  0.02748997
5 4049994 203999.9 212000.0 263000.3    36259     7.0     60.7  0.03214295
6 4082999 204999.8 208998.5 258999.8    37426     6.2     61.5  0.04897637
  • No of rows and columns

    dim(df)
    [1] 336  34
  • Null and missing values

    print(c(sum(is.na(df)),sum(is.null(df))))
    [1] 2 0

Data Preprocessing

  1. Spirits

  2. Beertax

  3. Baptist

  4. Mormon

  5. Afatal

We believe that the above variables helps to cluster the car crashes due to alcohol consumption

df=df[c(3,7,8,9,26)]

Convert the percentage of Baptists and Mormons to a count

df$baptist=df$baptist/100 df$mormon=df$mormon/100 df$baptist=df$baptist*df$pop df$mormon=df$mormon*df$pop

Check if there are any missing or null values in our data

print(c(sum(is.na(df)),sum(is.null(df))))
[1] 0 0
describe(df)
        vars   n      mean        sd   median   trimmed      mad    min
spirits    1 336      1.75      0.68     1.67      1.67     0.53   0.79
beertax    2 336      0.51      0.48     0.35      0.42     0.24   0.04
baptist    3 336 353164.70 557450.81 58039.21 239427.34 84461.06   0.00
mormon     4 336  60798.39 160724.27 17809.97  24459.64 17199.78 930.33
afatal     5 336    293.33    303.58   211.59    234.89   200.14  24.60
               max      range skew kurtosis       se
spirits       4.90       4.11 2.20     6.63     0.04
beertax       2.72       2.68 2.17     5.14     0.03
baptist 2879464.51 2879464.51 2.26     6.24 30411.44
mormon  1049096.92 1048166.59 5.04    26.72  8768.23
afatal     2094.90    2070.30 2.70     9.28    16.56

The five graphs below show a boxplot of each variable of interest. As seen below there are some outliers that need to be removed so we can cluster our data effectively.

    boxplot(df$beertax, horizontal = TRUE, xlab="Count", ylab="Beertax")

    Baptist

    boxplot(df$baptist, horizontal = TRUE, xlab="Count", ylab="Baptist")

    Mormon

    boxplot(df$mormon, horizontal = TRUE, xlab="Count", ylab="Mormon")

    Afatal

    boxplot(df$afatal, horizontal = TRUE, xlab="Count", ylab="Fatal Car Crashes (Alcohol)")

    First, check if there are any outliers in our data and remove them before scaling the data. We need to remove the outliers since it can affect the centroid in the k-means clsutering process. In addition, the data needs to be scaled since k-means clustering is reliant on Euclidean distance and a large difference in values can adversly affect it.

    detection=function(x) {   q1=quantile(x,probs=.25)    q3=quantile(x,probs=.75)    iqr=q3-q1    x>q3+(iqr*1.5)|x<q1-(iqr*1.5)}  elimination= function(df,cols=names(df)) {   for(col in cols) {      df=df[!detection(df[[col]]),]}   return(df)}  column_names=c('spirits','baptist','mormon', 'beertax', 'afatal') df2=df cleaned_df_2=elimination(df2,column_names) cleaned_df2=as.data.frame(scale(cleaned_df_2)) boxplot(cleaned_df2)

    detection=function(x) {   q1=quantile(x,probs=.25)    q3=quantile(x,probs=.75)    iqr=q3-q1    x>q3+(iqr*1.5)|x<q1-(iqr*1.5)}  elimination= function(df,cols=names(df)) {   for(col in cols) {      df=df[!detection(df[[col]]),]}   return(df)}  column_names=c('spirits','baptist','mormon', 'beertax', 'afatal') df2=df cleaned_df_2=elimination(df2,column_names) cleaned_df2=as.data.frame(scale(cleaned_df_2)) boxplot(cleaned_df2)

    boxplot(cleaned_df2)

    Optimal Number of Clusters

    The elbow method is performed and as shown by the graph, the optimal number of clusters is 6.

    fviz_nbclust(cleaned_df2, kmeans, method = "wss")

    According to this graph the optimal number of clusters could be either 10 or 6. These numbers were chosen since they are the closest to 1 for average silhouette width.

    fviz_nbclust(cleaned_df2, kmeans, method = "silhouette")

    Lastly we are using the gap statistic as a final check to determine the best number of clusters. This graph also shows 6 clusters as being the most optimal.

    fviz_nbclust(cleaned_df2, kmeans, method = "gap_stat")

    Modeling with K-means Clustering

    Building the model using our variables of interest with the selected number of clusters as 6

    custom_colors=c("red","green", "yellow", "orange", "blue", "cyan") km <- kmeans(cleaned_df2, centers= 6, nstart = 25) fviz_cluster(km,data=cleaned_df2, geom = "point")+scale_color_manual(values = custom_colors)+scale_fill_manual(values = custom_colors)

    Results:

    Two clusters tables are created to show average characteristics of each cluster

    # A tibble: 6 × 6
      cluster spirits beertax baptist mormon afatal
        <int>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>
    1       1    1.47   0.930 865674. 22263.   323.
    2       2    1.72   0.237  58579. 40587.   117.
    3       3    1.39   0.501  22168.  6693.   113.
    4       4    2.18   0.245  18809.  6921.   142.
    5       5    1.60   0.270 100785. 24841.   473.
    6       6    1.24   0.350 842165. 14902.   295.
    # A tibble: 6 × 6
      cluster spirits beertax baptist mormon afatal
        <int>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>
    1       1  -0.456   1.85    1.60   0.392  0.584
    2       2   0.137  -0.617  -0.517  1.72  -0.723
    3       3  -0.635   0.321  -0.612 -0.734 -0.750
    4       4   1.20   -0.589  -0.621 -0.717 -0.567
    5       5  -0.149  -0.503  -0.406  0.578  1.54 
    6       6  -0.980  -0.218   1.53  -0.140  0.409

    Cluster Analysis(average)

    • Cluster 1: This cluster is characterized by an average consumption of 1.47 alcoholic drinks per person, with 323 alcohol-related car crashes on average. The average beer tax is relatively high at 93 cents. It predominantly comprises Baptists rather than Mormons.

    • Cluster 2: Cluster 2 exhibits a slightly higher average consumption of alcoholic drinks per person (1.72) compared to Cluster 1, with a lower average of 117 alcohol-related car crashes. The average beer tax is notably lower at 23.7 cents. This cluster consists of somewhat more Mormons than Baptists.

    • Cluster 3: In Cluster 3, the average consumption of alcoholic drinks per person is 1.39, with 113 alcohol-related car crashes on average. The average beer tax falls in between the previous clusters at 50.1 cents. This cluster has similar proportions of Baptists and Mormons.

    Cluster Analysis(average)

    • Cluster 4: This cluster demonstrates the highest average consumption of alcoholic drinks per person (2.18) among the clusters, with 142 alcohol-related car crashes on average. The average beer tax is similar to Cluster 2 at 24.5 cents. Similar to Cluster 3, it contains similar amounts of Baptists and Mormons.

    • Cluster 5: Cluster 5 has an average consumption of 1.6 alcoholic drinks per person, with 473 alcohol-related car crashes on average. The average beer tax is 27 cents. This cluster contains more Mormons than Baptists.

    • Cluster 6: Finally, Cluster 6 displays the lowest average consumption of alcoholic drinks per person (1.24), with 295 alcohol-related car crashes on average. The average beer tax is 35 cents. Significantly, this cluster contains more Baptists than Mormons.

    Cluster analysis (by variable)

    In this first chart it is shown that cluster 5 had the highest number of car crash fatalities related to alcohol while clusters 2 and 3 had the lowest amount. Cluster 1 had the second highest amount.

    In this next chart, cluster 4 had the highest number of alcohol consumed while cluster 6 had the lowest amount. The second highest was cluster 2.

    Cluster analysis (by variable)

    As shown below, clusters 1 and 6 had the highest population of Baptists while clusters 3 and 4 had the lowest amount.

    In this chart, cluster 2 had the highest population of Mormons and clusters 3 and 4 had the lowest population.

    Cluster analysis (by variable)

    Cluster 1 had the highest beertax while clusters 2 and 4 had the lowest beertax.

    Cluster analysis (state)

    The below graphs show the same information as the bar charts but at the state level. We created a subset of the original data that now has the clusters from our k-means clustering analyis to make inferences. This is to highlight which states belong to which cluster.

    Cluster analysis (state)

    State map based on clusters

    Below is the map in which states are divided based on clusters

    IN CONCLUSION

    • The analysis of the “Fatalities” dataset, sourced from the US Department of Transportation’s Fatal Accident Reporting System, was conducted to understand fatal car crashes.

    • K-means clustering was utilized as an effective method for grouping data to extract insights.

    • The analysis process involved data pre-processing, determining the optimal number of clusters, and implementing the K-means clustering model.

    • Six clusters were generated to elucidate trends related to alcohol-related car crashes, considering factors such as alcohol consumption, religion, and beer tax.

    • The clusters helped to create subregions within the United States, each exhibiting distinct characteristics and providing insights into alcohol handling and its impacts.

    • Insights derived from the analysis can aid in identifying potential areas for targeted interventions regarding alcohol-related issues.

    • Acknowledgement of the subjectivity involved in selecting the appropriate number of clusters, with acknowledgment of potential variability in results and accuracy.

    • The decision-making process for determining the optimal number of clusters was facilitated by methods such as the Elbow method, Silhouette method but decision became easier after observing the Gap Statistic.